analysis happiness dataset from kaggle https://www.kaggle.com/unsdsn/world-happiness

objectives: evaluate happiness levels among different countries evaluate evolution of happiness over the years identify the most/least happy countries identify factors correalted with happiness

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
In [2]:
df = pd.read_csv("2019.csv")
In [3]:
df.head()
Out[3]:
Overall rank Country or region Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298
In [4]:
df.shape
Out[4]:
(156, 9)

there are 156 rows and 9 columns

In [5]:
df.columns
Out[5]:
Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

list of columns

In [6]:
df.isna().sum()
Out[6]:
Overall rank                    0
Country or region               0
Score                           0
GDP per capita                  0
Social support                  0
Healthy life expectancy         0
Freedom to make life choices    0
Generosity                      0
Perceptions of corruption       0
dtype: int64

there are no missing values

In [7]:
df[["Country or region", "Score"]].sort_values(by= "Score", ascending=False).head(10)
Out[7]:
Country or region Score
0 Finland 7.769
1 Denmark 7.600
2 Norway 7.554
3 Iceland 7.494
4 Netherlands 7.488
5 Switzerland 7.480
6 Sweden 7.343
7 New Zealand 7.307
8 Canada 7.278
9 Austria 7.246

the 10 countries with the highest happiness score. 8 out of ten are located in Europe

In [8]:
df[["Country or region", "Score"]].sort_values(by= "Score", ascending=True).head(10)
Out[8]:
Country or region Score
155 South Sudan 2.853
154 Central African Republic 3.083
153 Afghanistan 3.203
152 Tanzania 3.231
151 Rwanda 3.334
150 Yemen 3.380
149 Malawi 3.410
148 Syria 3.462
147 Botswana 3.488
146 Haiti 3.597

top 10 countries with the lowest happiness score. The majority are located in Africa

In [9]:
df[["Country or region", "GDP per capita"]].sort_values(by= "GDP per capita", ascending=False).head(10)
Out[9]:
Country or region GDP per capita
28 Qatar 1.684
13 Luxembourg 1.609
33 Singapore 1.572
20 United Arab Emirates 1.503
50 Kuwait 1.500
15 Ireland 1.499
2 Norway 1.488
5 Switzerland 1.452
75 Hong Kong 1.438
18 United States 1.433

top 10 countries by GDP per capita

In [10]:
df[["Country or region", "Perceptions of corruption"]].sort_values(by= "Perceptions of corruption", ascending=False).head(10)
Out[10]:
Country or region Perceptions of corruption
33 Singapore 0.453
151 Rwanda 0.411
1 Denmark 0.410
0 Finland 0.393
7 New Zealand 0.380
6 Sweden 0.373
5 Switzerland 0.343
2 Norway 0.341
13 Luxembourg 0.316
15 Ireland 0.310

top 10 countries by Perceptions of corruptions

In [11]:
df.corr()
Out[11]:
Overall rank Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
Overall rank 1.000000 -0.989096 -0.801947 -0.767465 -0.787411 -0.546606 -0.047993 -0.351959
Score -0.989096 1.000000 0.793883 0.777058 0.779883 0.566742 0.075824 0.385613
GDP per capita -0.801947 0.793883 1.000000 0.754906 0.835462 0.379079 -0.079662 0.298920
Social support -0.767465 0.777058 0.754906 1.000000 0.719009 0.447333 -0.048126 0.181899
Healthy life expectancy -0.787411 0.779883 0.835462 0.719009 1.000000 0.390395 -0.029511 0.295283
Freedom to make life choices -0.546606 0.566742 0.379079 0.447333 0.390395 1.000000 0.269742 0.438843
Generosity -0.047993 0.075824 -0.079662 -0.048126 -0.029511 0.269742 1.000000 0.326538
Perceptions of corruption -0.351959 0.385613 0.298920 0.181899 0.295283 0.438843 0.326538 1.000000

The "score" variable is strongly positively correlated with "GDP per capita", "social support", "healty life expectancy" The "score" variable is positively correlated with "freedom to make life choices" Very strong positive correlation between GDP per capita and Helthy life expectancy

In [12]:
sns.relplot(x ="GDP per capita", y = "Score", data=df)
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x7f8a4da99d30>
In [13]:
sns.relplot(x ="Social support", y = "Score", data=df)
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x7f8a4daae7c0>
In [14]:
sns.relplot(x ="Healthy life expectancy", y = "Score", data=df)
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x7f8a4dcd00a0>
In [15]:
sns.relplot(x ="Perceptions of corruption", y = "Score", data=df)
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x7f8a4de19940>
In [16]:
plt.title("Distribution happiness score")
sns.kdeplot(df["Score"],shade=True)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a4dee34c0>
In [17]:
df.kurtosis()
Out[17]:
Overall rank                   -1.200000
Score                          -0.608375
GDP per capita                 -0.769902
Social support                  1.229005
Healthy life expectancy        -0.302895
Freedom to make life choices   -0.068857
Generosity                      1.173189
Perceptions of corruption       2.416824
dtype: float64

kurtosis of each column

In [18]:
df.skew()
Out[18]:
Overall rank                    0.000000
Score                           0.011450
GDP per capita                 -0.385232
Social support                 -1.134728
Healthy life expectancy        -0.613841
Freedom to make life choices   -0.685636
Generosity                      0.745942
Perceptions of corruption       1.650410
dtype: float64

skewness of each column

In [19]:
#PCA analysis (dimensionality reduction)
In [20]:
df1 = pd.read_csv("happinesscode.csv", delimiter=";")

I imported the dataset with ISO code

In [21]:
fig = px.choropleth(df1, locations="Country Code",
                    color="Score", 
                    hover_name="Country or region", 
                    color_continuous_scale=px.colors.sequential.Plasma, title="Happiness Score 2019")
fig.show()
In [22]:
df2015 = pd.read_csv("2015.csv")
df2016 = pd.read_csv("2016.csv")
df2017 = pd.read_csv("2017.csv")
df2018 = pd.read_csv("2018.csv")
df2019 = pd.read_csv("2019.csv")
In [23]:
df2015["Year"] = 2015
df2016["Year"] = 2016
df2017["Year"] = 2017
df2018["Year"] = 2018
df2019["Year"] = 2019
In [24]:
df2015.rename(columns={"Country": "Country or region", "Happiness Rank": "Overall rank", "Happiness Score": "Score", "Economy (GDP per capita)":"GDP per capita", "Family": "Social support", "Health (Life Expectancy)": "Healthy life expectancy", "Freedom": "Freedom to make life choices", "Trust (Government Corruption)": "Perceptions of corruption"}, inplace=True)
df2015.drop(columns=["Region", "Standard Error", "Dystopia Residual"], inplace=True)
df2015.rename(columns={"Economy (GDP per Capita)": "GDP per capita"}, inplace=True)
In [25]:
df2016.rename(columns={"Country": "Country or region", "Happiness Rank": "Overall rank", "Happiness Score": "Score", "Economy (GDP per Capita)":"GDP per capita", "Family": "Social support", "Health (Life Expectancy)": "Healthy life expectancy", "Freedom": "Freedom to make life choices", "Trust (Government Corruption)": "Perceptions of corruption"}, inplace=True)
df2016.drop(columns=["Region", "Dystopia Residual", "Lower Confidence Interval", "Upper Confidence Interval"], inplace=True)
In [26]:
df2017.rename(columns={"Country": "Country or region", "Happiness.Rank": "Overall rank", "Happiness.Score": "Score", "Economy..GDP.per.Capita.":"GDP per capita", "Family": "Social support", "Health..Life Expectancy.": "Healthy life expectancy", "Freedom": "Freedom to make life choices", "Trust..Government.Corruption.": "Perceptions of corruption"}, inplace=True)
df2017.drop(columns=["Dystopia.Residual", "Whisker.low", "Whisker.high"], inplace=True)
df2017.rename(columns={"Health..Life.Expectancy.": "Healthy life expectancy"}, inplace=True)
In [27]:
df2015 = df2015[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
df2016 = df2016[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
df2017 = df2017[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
df2018 = df2018[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
df2019 = df2019[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
In [28]:
df_all_years = pd.concat([df2015,df2016,df2017,df2018,df2019])
df_all_years.to_excel("happiness_all_years.xlsx")
In [29]:
df_all_years = pd.read_csv("happiness_all_years.csv", delimiter=";", na_values=["#N/D"])
In [30]:
df_all_years.drop(columns="Unnamed: 0", inplace=True)
In [31]:
df_all_years.dropna(inplace=True)
In [32]:
df_all_years = df_all_years[df_all_years.Year != 2017]
In [33]:
df_all_years["Score"] = df_all_years.Score.astype(float)
In [34]:
px.choropleth(df_all_years,               
              locations="Country Code",               
              color="Score",
              hover_name="Country or region",  
              animation_frame="Year",    
              color_continuous_scale='Plasma',  
              height=500)

we can see that between 2015 and 2019 there has been an overall decrease in the happiness scores. It seems like countries have become less happy between 2015 and 2019

In [35]:
#PCA ANALYSIS

plot_zoom.png

the first three principal components explain 83% of the variability

Schermata%202020-11-24%20alle%2020.06.52.png

first three principal components. The first principal component is mainly an average of the variables The second principal component contrasts freedom to make life choices, generosity, perceptions of corruption with the other 3 variables The third principal component analysis contrasts perception of corruption with generosity and social support

plot_zoom.png

result of a pca analysis: 1)considering the first principal component (x axis) on the right we find countries with high values for all the variables (apart from generosity) on the left we find countries with low values for basically all the values considering the second principal component (y axis) a the bottom we find countries with high values for generosity, perception of corruption and freedom to make life choices compared to the other variables. At the top we find countries with high values GDP per capita, social support, healthy life expectancy compared to the other variables